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ABSTRACT 

Definition of the issues to the use of latent trait 
■odels, specifically one- and three-paraaeter logistic aodels, in 
conlunction with aulti^level achieveaent batteries, foras the basis 
of this paper. Research results related to these issues ara also 
docuaen^.ed in ar atteapt to provide a rational basis for codel 
selection. The application of the latent trait aodel.s is evaluated in 
teras of: (1) the assistance they can lend to test construction: (2) 
their use in the vertical equating of test scores; and (3) th^ir use 
in coaparing score scales. It is suggested that the target 
Ir^.foraation function of the latent trait aodels can give control over 
lest construction and precision, especially with the three* paraaeter 
aodel: but, there appears to be no acceptable vay to select test 
Iteas using the one-paraaeter aodel. The latter aodel is criticized 
for perpetuating the idea that 11 iteas are equals Even vhen the 
test battery is sorted into unldiaensional subtests, it is reported 
that neither the one* nor the three-paraaeter aodel is adequate for 
vertical equating purposes. It is concluded that both aodels have 
advantages and disadvantages. The recoaaendation that both aodels be 
used in conlunction with traditional procedures is suggested. 
lAEP) 
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In the two decades since 1960, a revolution has taken place in the areas 
^ of test theory and its applications. The one- and three-parameter logistic 
models that were obscure curiosities in 1960 have become widely researched, 
^ ^ sometimes applied, and often debated components of measurement technology. 
. ^ Computer programs have been developed to determine estimates of the parameters 
<0 of these models (Wright & Panchapakesan, 1968; Wood, Wingersky & Lord, 1976), 
O place the estimates on the same scale (Reckase, 1979), and estimate the ability 
r\J of individuals for sets of items (Owen, 1976). The one-parameter model (Rasch) 
Q has been applied to reading tests (Woodcock, 1974; Rentz & Bashaw, 1977) and 
Qj to state testing programs (Forbes, 1976). The three-parameter model has had 
fewer but similar applications (Cowell, 1979; Lord, 1968). 

Because of the growing acceptance of latent trait models as an improve- 
ment over traditional procedures, serious thought is now being given to ap- 
plying the various latent trait models to major testing programs. The one- 
parameter model is being considered for use in the analysis and equating of 
the widely used i>ianford Achievement Test, and the three-parameter logistic 
model is being considered as the basis for tailored testing administration of 
the Armed Services Vocational Aptitude Battery (ASVAB). Unfortunately, these 
procedures are often accepted for use without a thorough evaluation of compet- 
ing procedures. When competing procedures are considered, the selection is often 
made on the l>asis of philosophical distinctions rather than empirical eval- 
uations. The purpose of this paper is to review the data available related to 
the application of one- and three-parameter logistic models and to try and give 
some rational basis for model selection. The particular orientation used in 
the evaluation of models will be towards use in the development of multi-level 
achievement batteries. 



Characteristics Required of t he Measurement Models 
Q 

^ Before trying to compare the statistical models that might be used in the 

development and application of multi-level achievement test batteries, some 

^ consideration must be given to what is required in the development and use of 

such batteries. Three general areas were defined which require the use of test 
analysis procedures. They include (a) assistance in test construction, (b) ver- 

^ tical equating of test forms, and (c) formation of multi-level score scale. The 

1^ 
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test construction component can be further subdivided into (a) formation of 
an Item pool, (b) item selection, and (c) evaluation of test quality. In or- 
der to make recommendations concerning which model to use with multi -level 
achievement tests, the applications of the models to each of these areas will 
be reviewed and recommendations developed. These area recommendations will 
then be synthesized and a global recommendation given. 



Evaluation of the Applications of Latent Trait Models 
Assistance in test construction 

The first step in the construction of any test Is usually the production 
and tryout of a large number of test items. The traditional analyses performed 
on the tryout data are the computation of item analysis statistics such as diffi- 
culty and discrimination indexes. The sheer number of items required in a multi- 
level achievement battery make it Impossible for all items to be administered to 
all subjects. This fact unforturmtely means that the item statistics from dif- 
ferent groupi are not comparable. Latent trait theory approaches overcome this 
lack of comparability by making it relatively easy to translate item statistics 
determined using different groups onto the same scale. The process has been 
labeled "linking" In the current literature. 

Very little has been done to evaluate linking procedures In the latent trait 
research literature. One paper by Reckase (1979), however, has discussed some 
of the basic Issues In item pool linking. In that paper, various techniques for 
Item pool linking were reviewed, and linked Item calibration results from a ser- 
ies of 50 item tests were compared to the calibration of the full 357 Items avail- 
able, as calibrated on a sample of 4,000 examinees. Sample size and number of 
common items In the linked tests were variables of Interest In the study. The 
results generally showed that the one-parameter model could be used to link cali- 
brations fairly well If five or more Items were in common between tryout tests 
and If samples of 300 or more were used. For the three parameter model, sample 
sizes of 1,000 or more were required for acceptable linking and tne use of the 
major axis method with the ANCILLES calibration program (Urry, 1978) or maximum- 
likelihood linking with the LOGIST program (Wood, Wingersky & Lord, 1976) were 
found to yield the best results. The three-parameter procedure requires a larger 
sample size because three parameters need to be estimated and placed on the same 
scale, rather than just one parameter. If sample size were the o,nly Issue, the 
one-parameter model would be selected at this point because it gives adequate 
results using fewer cases. However, sample size is not the only Issue. 

Once an item pool is generated and tried out, the next step in the construct- 
ion of the test battery is the selection of the actual test items to be used on 
the final forms. The traditional procedure used to select items is to pick the 
items with high discrimination indices and with item difficulty in the appropriate 
range. 
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EssentiaHy the same process can be followed if the three-parameter logistic 
model is the basis for item calibration. Items with high discrimination para- 
meter estimates and appropriate difficulty parameter estimates can be selected 
for the use on the tests. Alternatively, a target information function can be 
specifieo and then items can be selected so that their item information functions 
sura to the target function. Lord (19/7) describes this process in more detail. 

Item selection using the one-parameter model is typically done differently. 
In conjunction with the item calibration procedure, a test of the fit of the one- 
parameter model to the item data is performed, usually using some type of squared 
difference between observed and expected frequencies (Wright & Stone, 1979). 
These squared differences are assumed to have a chi -square distribution, and 
probabilities of fit are determined accordingly. The basis for item selection 
Is to pick items for which the model fits, while deleting the rest. In theory, 
three violated assumptions contribute to lack of fit with this model. The vio- 
lations are variation In discrimination, nonzero guessing, and multi-dimensionality. 
Thus selecting items on the basis of one-parameter logistic model fit should 
yield tests with low guessing, moderate and equal discrimination, and unidimen- 
sionality. 

A number of studies have tried to determine whether selecting on fit will 
really yield tests with the above characteristics, Reckase (1979), in a study 
of 150 tests Items from a series of classroom tests, found lack of fit to be 
strongly related to guessing level. Brooks (1964) found that selecting items 
that were fit by the nodel yielded lower test reliabilities, as measured by the 
KR.20 formula* than did traditional methods. Hambleton (1969) and Panchapakesan 
(1969) used simulation procedures in their dissertations to determine the effect 
of variation In discrimination and guessing on fit. They found that items that 
differed from the average discrimination parameter by more than .20 tended to 
cause lack of fit. Guessing was also found to have a strong influence on fit. 
ThuSj variation in discrimination and nonzero guessing do seem to have something 
to do with lack of fit* but avoiding violations of the assumptions did not seem 
to improve reliability. In some cases (Anderson, Kearney & Everett, 1968), lack 
of fit did not seem to be related to any item characteristic. 

One possible cause of the ambiguous results concerning the selection of 
items on the basis of model fit is the basic inadequacy of the chi -square fit 
statistic. Fbrster ft Karr (1980) summarize these inadequacies as follows: 

"First, It is sensitive to differences in sample size- 
generally low for small samples, generally high for large 
samples. Second, it disproportionately weights differ- 
ences near the top or bottom of the item curve. Finally, 
It does not provide an adequate indication of where the 
actual and ideal distributions do not fit." 

As an alternative, they suggest looking at the empirical and theoretical item 
characteristic curves to check fit. Our own experience has shown the fit 
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statistics to be relatively meaningless. Sometimes more than half of the items 
were found not to be fit by the model (e.g., 29 out 50 items from the Iowa 
Tests of Educational Development with a sample of 2,000). 

To summarize these results, there seems to be no good procedure for se- 
lecting items with the one-parameter logistic model. Not only do the fit stat- 
istics not work well, but no reason can be thought of for selecting items with 
discrimination parameter: equal to the mean discrimination in the pool. Typi- 
cally, use of the best items in a pool would seem desirable, as opposed to using 
the mediocre items as suggested by selection on the basis of one-parameter model 
fit. 

After the items are selected from the item pool for use on the forms of a 
test, an overall measure of test quality is usually determined. Traditionally, 
a test reliability coefficient is computed based on the tryout sample. Unfort- 
unately, this type of statistic is highly sample dependent and only yields an 
average measure of precision over the full range of ability measured by the test 
The test Information function (Birnbaum, 1968) used with the latent trait models 
has substantial advantages over the reliability coefficient because it gives an 
Indication of test precision at all points of the ability scale, and because it 
Is s. .^what sample Independent. Since the information function can be defined 
for either the one- or three-parameter models, its availability does not give 
an advantage to either model, although the availability of the information funct 
Ion does argue for the use of latent trait models. The three-parameter model 
does tend to give higher values for the information function because of the 
ability to select on the basis of the discrimination and guessing parameters 
(Koch & Reckase, 1978). 

One other concept that should be kept in mind during the process of Item 
selection is the content validity of the test formed. Most achieve- 

ment tests are produced using a table of specifications to help Insure content 
validity. This procedure does not usually insure that the test produced will 
have one dimension-— a basic assumption of the latent trait models. The use 
of a number correct score Ignores this issue by simply summing the varied con- 
tent areas In whatever proportion they happen to appear In the test. The one- 
parameter model treats the possible multldimensionality in the same way, since 
the raw score is a sufficient statistic for the one-parameter ability estimate. 
The three-parameter model yields an ability estimate with a different interpre- 
tetlon because its ability estimates are based on a weighted sum of Item scores, 
with the weights being the discrimination parameter estimates. This weighting 
procedure has the effect of emphasizing the Items measuring one factor In the 
test while Ignoring the others. Thus, for the most part, the thrre-parameter 
model ability estimates do not contain Information from every component in the 
test. They only emphasize the largest component (see Reckase, 1979 for a more 
thorough discussion). This should be kept in mind in selecting the model to 
be used In obtaining test scores. 
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Vertlcal Equatin g 

When a multi -level achievement battery (a battery with forms at several 
grade levels) ir produced, it is usually desirable to equate the scores on 
the various levels so that scores on different levels can be compared. Tra- 
ditionally, some type of equi percentile approach was'used to equate the test 
levels. That is, tests at two different levels are administered to the same 
group and scores with the same percentile rank in the groups are said to be 
equivalent. Angoff (1971) does a good job of summarizing the traditional pro- 
cedures for equating. 

In recent years latent trait theory approaches have been suggested as an 
alternative to the time honored procedures because of several theoretical advan- 
tages. These advantage? include the sample independent nature of item and 
ability parameter estimates, the capability of getting ability estimates on the 
same scale regardless of the set of items admin-istered, and the possibility of 
equating tests using common items as opposed to administering two tests to the 
same sample. 

Both one-parameter (Wright & Stone, 1979) and three-parameter (Marco, 1977) 
logistic model based procedures have been developed for vertical equating, but 
only recently have these procedures been evaluated. Both models typically use 
a procedure that has items in common between the tests to be equated. The items 
In the tests are then calibrated separately, and a linear equation is deter- 
mined to translate the item difficulty estimates for the common items from one 
calibration to the other. This same transformation is used to translate the 
ability scale of one test to the scale of the other. Alternatively,*two tests 
can be administered to the same sample of people and the linear transforjnation 
can be found to equate the ability estimates. 

Several studies have been done fco evaluate the quality of the vertical 
equating done using the one-parameter model (Slinde & Linn, 1978, 1979; 
Gustafsson, 1979; Loyd & Hoover, 1980). Slinde & Linn (1979) equated easy and 
difficult tests constructed from 60 reading and vocabulary Items from the SRA 
Achievement series. Three different ability groups, high, middle, and low, 
formed on the basis of Comprehensive Tests of Basic Skills scores were used for 
the calibration. They found that "For extreme comparisons which involve widely 
separated groups and tests of substantial different difficulties, the Rasch mo- 
del does not seem to result in adequate vertical equating of existing tests." 
They felt that guessing was a major cause of the poor results. Despite the* 
poor results, they felt that the one-parameter model would work reasonably well 
with groups that are closer in ability and tests closer in difficulty. 

Loyd and Hoover (1980) also obtained negative results when evaluating ver- 
tical equating using the one-parame*-sr model. Their study was somewhat more 
realistic than the Slinde & Linn (1979) study because they used three existing-- 
test levels of the Iowa Test of Basic Skills Instead of constructing easy and 
difficult tests specifically for the study. They also used sixth, seventh, 
«nd eighth grade groups rather than forming different ability grt'.ups on the 
basis of ttst scores. Their results showed that equating using the one- 
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parameter model was "inconsistent." The equated scores tended to be higher 
than the scores that would be obtained if the original test had been taken. 
In an example of translating scores from one level to another, they demon- 
strated that the differences could be quite large. On the basis of their 
results, they conclude that "While latent trait methods show a great deal of 
promise for improving horizontal equating [linking] of tests, results of the 
present study, and others, indicate that the use of the Rasch model ir, verti- 
cal equating should be approached with extreme caution." Loyd & Hoover (1980) 
feel that the major cause of the equating problems is the change of course 
content with the change in test level. 

The results of these studies on vertical equating certainly do not seem 
positive for the one-parameter model. Unfortunately, similar evaluations of 
the vertical equating procedures based on the three-parameter model do not 
yield much better results and are much harder to find. Three relatively ob- 
scure studies could be found that evaluated three-parameter based vertical 
equating. Kolen (1981) compared equipercentile, three-parameter, and one- 
parameter based procedures on the basis of consistency in a cross-validation 
study and found that the equipercentile and three-parameter based procedures 
worked best, while the Rasch model procedure worked poorest. 

Patience (1981), in an unpublished paper, reported slightly different re- 
sults. He compared the equated score scale from an easy, middle, and hard 
test with the score scale obtained from the three tests administered together 
as one. The three tests were produced from the verbal subtests of the Iowa 
Tests of Educational Development in such a way that items were in common be- 
tween the tests. Patience (1931) found that the equipercentile procedure 
worked best, followed by the one-parameter procedure and then the three- 
parameter procedure. He attributed the poor showing of the three-parameter 
model to unstable pstimates of the item parameters despite a relatively large 
sample size (1,000). 

Marco, Petersen & Stewart (1980), ir a very complicated and elaborate 
study, evaluated the equating of easy, medium and hard tests formed from the 
verbal subtest of the Scholastic Aptitude Test using samples of 1,577 cases. 
They used a series of statistics based on the difference between actual and 
equated scores. Five different types of equating procedures were used in- 
cluding procedures based on the one- and three-parameter logistic model. Their 
results indicated that "For most equatings, the model with the smallest total 
error was the 3-para«eter ICC model." The one-parameter model did relatively 
poorly in this study. Equipercentile equating was the next best procedure 
after the three-paraseter model based procedures. 

The overall trend of the results reported here seem to indicate that the 
use of the one-pararaeter model results in serious problems when it is i^oplied 
to vertical equating, while the value of the three-parameter model is yet to 
be determined in this area because of the varied results available. The safest 
conclusion m.ight be to recommend equipercentile baised procedures, since they 



worked well in both the Kolen (1981) and Patience (1981) studies. 



Score Scales 

The underlying purpose of vertically equating the set of tests^lfi a multi- 
level battery is to form a single score scale to which all scores on uliJ test 
levels can be transformed. Such a multi-level score scale will allow compari- 
son to be made across all test levels. Tracing the improvement in performance 
of a student is one possible application of such a scale. In theory, the de- 
velopment of such a scale should be easily accomplished on the basis of item 
characteristic curve models, since any set of items can be used to get scores 
on the same scale once all of the item calibrations have been linked. Unfort- 
unately, as described in the previous section, the use of latent trait models 
for vertical equating cannot currently be considered as an acceptable procedure. 

with the extensive research being done with latent trait models, there is 
hope that the problems in vertical equating will be solved in the future. There 
fore it is important to look at the implications of the use of latent trait mo- 
dels for the interpretation of the resulting multi-level score scale. The 
necessity of the analysis of the possible scales Is further motivated by the 
difference in the meaning of ability estimates obtained through the use of the 
one-parameter and three-parameter logistic models. As mentioned earlier, the 
one-parameter model yields ability estimates that have meaning equivalent to 
that of the raw scores, but that are on a transformed scale. This results in 
an ability estimate that is based on the sum of the various components of the 
test, where the components are weighted by the nunber of items measuring the 
component. The three-parameter model yields an aoility estimate with a dif- 
ferent interpretation. Since the estimates are based or. a weighted sum of 
Item responses, the weights being the Item discrimination parameters, and since 
the discrimination parameters are related to the first factor loading of a 
test (Lord I Novick, 1S68), the three-parameter based ability estimates are 
only related to the largest component of the test. 

In many cases, this distinction In the Interpretation of the score scale 
produced by the latent trait models will have little practical significance. 
Reckase (1979) has found that for many tests there Is a dominant first factor. 
The one-parameter and three-parameter ability estimates then correlate in the 
nigh .90's. However, if the dimensions present in the tests contained in a 
fflultl -level battw'y change with the test level, as suggested by Loyd & Hoover 
(1980), the ability scales defined by the two procedures might be quite dif- 
ferent. If, for example, the major factor in the tests at the various levels 
remained the same, but the actual proportion of various content areas changed, 
the three-parameter logistic based estimates would maintain the same meaning, 
while the one-parameter estimates would change in meaning with the level tested. 
Of course, the same change In meaning would also occur In the raw scores 
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On the other hand, if the major component of the test changed across levels, 
both procedures would result in ability estimates that had different meanings 
at the different levels. Thus, changes in the dimensionality of the test bat- 
tery^at the different levels can be seen to be a critical issue in score scale 
development. The same problem plagues procedures based on raw scores and equi- 
percentile methods, but it has been mostly ignored until recenfiy because the 
assumptions of the methods were not clearly stated. 

Discussion 

In the course of this paper, an attempt has been made to define the issues 
related to the use of latent trait'models in conjunction with multi-level achieve- 
ment batteries, and to summarize some of the research results related to those 
issues. The issues identified included (a) the use of the models in test con- 
struction, (b) the use of the models for vertical equating, and (c) the com- 
parability of the score scales obtained using the models. The one- and three- 
parameter logistic models were concentrated on in the summary because the ma- 
jority of work has been done with these two models. 

Jn t..e area of test construction, the availability of the concepts of item 
and test information give the latent trait models a clear advantage over tra- 
ditional methods. Target information functions can give much greater control 
over test construction, putting test precision where it is desired. Also, the 
Information function indi-cates the precision at each ability level. 

For some reason the concept of the information function has been embraced 
by users of the three-^parameter model, while. being largely ignored by the users 
of the one-parameter model. The result is that there seems to be no accept- 
able way to select items for a test using the one-parameter model. The availa- 
ble research seems to Indicate that selecting on fit is not an adequate pro- 
cedure, and since all of the items are assumed to have equal discrimination and 
no guessing, the only source of selection information Is the estimate of item 
difficulty. The one-parameter model perpetuates the myth that all items are 
equal-— an Idea that has been accepted as long as raw scores have been used. 
This Is a serious problem with the use of the one-parameter model. 

The one-parameter model does have the advantages of simplicity of esti- 
mation and smaller sample size requirements than the three-parameter model. 
But this should not be a major issue for large group tests. Adequate numbers 
of cases for the requirements of the three-parameter model have been used for 
the analysis of the tests in the past. 

The Issue of how to vertically equate the tests In a hiul ti-level battery 
Is a serious one, and not only because of the use of latent trait models. The 
major consideration Is whether the tests have the same content and emphasis for 
the content at all of the levels. If not, the meaning of the ability estimates 
obtained from the tests changes over levels, a i none of the equating procedures 
available Is appropriate. 
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Under the circumstances, the best procedure would be to sort the battery 
Into unidimensional subtests that measure the same dimension at all difficulty 
levels. These subtests could then reasonably be equated over levels unless all 
levels fne?;ured the various different content areas in exactly the same pro- 
portions. 

Even under the above ideal circumstances, the one-parameter model does not 
seem to be adequate for vertical equating purposes. The Slinde Linn (1979) • 
and Loyd & Hoover (1980) papers show that the techniques being used have serious 
problems. Unfortunately, the three-parameter logistic based procedures have 
not consistently demonstrated to be any better. At this point the equipercen- 
tilfe based procedures seem to be the most acceptable. 

If unidimensional subsets' can be formed and equated, the score scale issue 
is not a serious one. Either model can be used since in therojnidimensional case 
the models will yie,ld ability estimates that are correlated ,95+. Since the one 
parameter model is cheaper to use, it should probably be selected. 



Recommendations 

It Is unfortunate that on the basis of the data presented, no clear choice 
can be made. The three-parameter model based procedures seem to be more de- 
sirable because of the better item selection procedures, the possibly better 
vertical equating procedures, and some possible advantages in the meaning of 
the ability estimates. But these same procedures require larger sample sizes, 
are more complex, and sometlsies yield unstable parameter estimates. The ore- 
parameter model has the advantages of simplicity and smaller scmple size re- 
quirements. 

Perhaps the solution is to avoid taking sides in the model selection con- 
troversy and suggest that both models be used in conjunction with traditional 
procedures. The tests in the battery could be designed using target information 
functions and items could be selected using linked calibration results from the 
three-parameter logistic model. By selecting the highly discriminating items, 
the resulting tests trauld have dominant first factors that were conmion to the 
tests at all levels, naking vertical equating reasonable. 

The vertical equating is implicit in the linking of the item pool of the 
test, since transfonilng the item difficulty parameters is the same as trans- 
forming the ability scale. However, due to the uncertainty of the value of 
three-parameter vertical equating procedure, it could be verified using tra- 
ditional equipercentlle equating. Finally, the multi-level score scale could 
be based on the one-parameter model, because by then the items selected would 
have reasonably similar discrimination parameter estimates and low guessing 
parameter estimates. The resulting score scale would than reflect all of the 
content areas In the tests. 
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